Data Processing in a Machine Learning Computer

ABSTRACT

A computer-implemented method of training a multi-layer neural network comprising a set of network weights, comprising: processing the training data in respective forward and backward passes through multiple layers, the forward pass comprising computing a set of activations in dependence on the network weights and training data, and the backward pass comprising: computing gradients of a pre-determined loss function with respect to the network weights and/or activations, wherein an adjustment parameter is applied to at least a subset of values in the neural network, the values comprising at least one of: the network weights, the activations, the gradients with respect to activations and the gradients with respect to weights; updating the network weights in dependence on the computed gradients; computing a proportion of the subset of values falling above a predefined threshold; and updating the adjustment parameter in dependence on the computed proportion.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority to U.S. Provisional PatentApplication No. 63/265,436 filed Dec. 15, 2021, the disclosure of whichis hereby incorporated herein by reference.

TECHNICAL FIELD

The present disclosure relates to processing data in a machine learningcomputer. Particularly, but not exclusively, this disclosure relates toprocessing of neural networks using mixed-precision numerical formats.

BACKGROUND

Deep neural networks are machine intelligence models used to perform awide variety of different tasks in different fields such as computervision (such as object recognition) and natural language processing(such as machine translation, natural language generation).

FIG. 1A illustrates an example machine intelligence model in the form ofa neural network. As will be familiar to a person skilled in the art ofmachine intelligence, machine intelligence begins with a learning stagewhere the machine intelligence algorithm learns a knowledge model. Themodel may be represented as a graph 60 of interconnected nodes 102 andlinks 104. Nodes and links may be referred to as vertices and edges.Each node 102 in the graph has one or more input edges and one or moreoutput edges, wherein some of the input edges of some of the nodes 102are the output edges of some others of the nodes, thereby connectingtogether the nodes to form the graph. Further, one or more of the inputedges of one or more of the nodes 102 form the inputs to the graph as awhole, and one or more of the output edges of one or more of the nodes102 form the outputs of the graph as a whole. Each edge 104 communicatesa value commonly in the form of a tensor (n-dimensional matrix), theseforming the inputs and outputs provided to and from the nodes 102 ontheir input and output edges respectively.

Each node 102 represents a function of its one or more inputs asreceived on its input edge or edges, with the result of this functionbeing the output(s) provided on the output edge or edges. These resultsare sometimes referred to as activations. Each function is parameterisedby one or more respective parameters (sometimes referred to as weights,though they need not necessarily be multiplicative weights). In generalthe functions represented by the different nodes 102 may be differentforms of function and/or may be parameterised by different parameters.In deep neural network architectures, nodes are arranged into layers,with nodes of each layer receiving tensors on the output edges of theprevious layer, and communicating their own outputs to nodes of the nextlayer in the network.

FIG. 1B furthermore provides a simplified representation of an examplenode 102. Each node 102 represents a function of its inputs. Some nodesreceive the inputs to the graph (in a multi-layer network, these nodesform the ‘input layer’) and some receive inputs from one or more othernodes (in a multi-layer network, these nodes are within ‘hidden layers’of the network). The output of nodes in input and hidden layers form theinputs of nodes in the respective next layer. At a final ‘output layer’,the output of the nodes provides the output of the graph.

Further, the function at each node is parameterised by one or morerespective parameters, e.g. weights 151, which are applied to the inputactivations to compute the input to the activation function 153, whichgenerates the output activation.

The activation function 153 is configured to receive weighted inputvalues and generate an output value based on the activation function.The activation function is typically attached to each node in thenetwork and determines whether it should be activated (“fired”) or not,based on whether each node's input is relevant for the model'sprediction. Certain activation functions, such as sigmoid or tanh, alsohelp normalise the output of each node to a range, for example between 1and 0 or between −1 and 1. Other activation functions, such as arectified linear unit (ReLU), do not normalise the output.

In a standard deep neural network architecture, each node of a givenlayer is connected via a link 104 to every node of a subsequent layer.Networks with this all-to-all connectivity may be referred to as ‘fullyconnected’. In a convolutional neural network however, each node of alayer applies a ‘filter’ of weights (which may also be referred to as akernel) in a sliding window to an input tensor to determine a weightedinput to a node 102, where the filter only applies to a subset of inputvalues to the given layer at a time. The subset of inputs that thefilter ‘sees’ at a time may be referred to as the receptive field. Othercommon neural network architectures include recurrent neural networksand transformer architectures. Various implementations of thesearchitectures exist in the art and will not be described further herein.

As described above, the output of each node or ‘neuron’ of a neuralnetwork depends on one or more parameters or weights applied to the setof inputs to that node. To train a neural network, the parameters ateach layer are updated according to a learning scheme, to optimise atraining goal. For example, where a goal is to train a network toidentify object classes present in an input image, the output layer maybe configured to output an indicator for a predicted class from among aset of possible classes, and the training goal may be to maximise anaccuracy of the neural network's prediction for a set of input imageswhere the class of the objects within the images are known. In thiscontext, deep neural networks are obtained by stacking multiple layers.The strength of these multi-layer architectures is that successivelayers have the possibility of reusing features that have been built bythe first layers, with a reuse of features that corresponds to anefficient implementation.

Learning is generally based on the iterative update of the parameters ofeach of the layers, typically through backpropagation. In practice,backpropagation based on gradient descent computes the gradient of theloss with respect to the output of the last layer, and then thisgradient is backpropagated using the chain rule of calculus. Withbackpropagation, each layer receives the gradient of the loss withrespect to its output, and uses this quantity to derive the gradient ofthe loss with respect to the parameters, the weights of that particularlayer. These quantities are then used to update the correspondingweights.

Gradient descent methods are highly effective and widely used to trainneural networks. However, a common problem when training very large deeplearning models comprising up to millions or even billions of weights isthat the memory required to store the weights, activations (whereactivations are the output values of each node) and gradients at eachlayer is significant.

One way to reduce the required memory usage when training a deeplearning model is to choose a representation for weights, activationsand/or gradients of the network such that each value occupies fewer bitsof memory.

In computing, bit sequences of predefined sizes are used to representnumbers. The particular representation of the bit sequence determineshow a bit sequence is interpreted. The general form of representation isthe floating-point representation, which is often used to approximatelyrepresent real numbers. The floating-point representation comprisesthree separate components, i.e. a sign bit s∈{0,1}, an m-bit mantissawith bits d_(i), i=1, . . . , m, and an e-bit exponent p, 0≤p<2^(e). Inthe single-precision (i.e. 32-bit) floating-point representationaccording to the IEEE 754 standard, the exponent consists of 8 bits, andthe mantissa consists of 23 bits. In the half-precision (i.e. 16-bit)floating-point representation, the exponent consists of e=5 bits, andthe mantissa consists of m=10 bits. In most cases, a floating-pointnumber is given from these three components by the following formula:

$( {- 1} )^{s}2^{p - b}( {1 + \frac{d_{1}}{2} + \frac{d_{2}}{2^{2}} + \cdots + \frac{d_{m}}{2^{m}}} )$

The displayed exponent bias b allows to offset the representation of theexponent. This exponent bias is commonly given by b=2^(e−1) −1 and isdependent on the number of bits e used to represent the exponent for thegiven floating-point format. In the single-precision representation, theexponent bias is equal to 2⁷ −1=127. In the half-precision format, theexponent bias is equal to 2⁴ −1=15.

As shown in the above formula, the representation of the mantissatypically relies on an implicit bit, which is derived from the exponent.In the case where the exponent bit sequence consists of anything otherthan all zeros or all ones, the implicit bit is equal to 1 and thenumber is known as a “norm”. In this case, the floating-point number isgiven by:

$( {- 1} )^{s}2^{p - b}( {1 + \frac{d_{1}}{2} + \frac{d_{2}}{2^{2}} + \cdots + \frac{d_{m}}{2^{m}}} )$

In the case that the exponent bit sequence consists of all zeros, theimplicit bit is equal to 0 and the number is known as a “denorm”. Inthis case, the floating-point number is given by:

$( {- 1} )^{s}2^{1 - b}( {0 + \frac{d_{1}}{2} + \frac{d_{2}}{2^{2}} + \cdots + \frac{d_{m}}{2^{m}}} )$

The denorms are useful, since they allow smaller numbers to berepresented than would otherwise be representable by the limited numberof exponent bits.

The other circumstance—in which the exponent bit sequence consists ofall ones—may be used to represent special cases, e.g. ±infinity or NaN(not a number). NaN is a numeric data type value representing anundefined or unrepresentable value. The presence of a NaN in the resultsof a calculation is often taken to signal an exception.

Another form of representation is the integer representation. Theinteger may be signed, in which case a single bit of the bit sequence isused to represent the sign of the number, with the remaining bits of thebit sequence used to represent the magnitude of the number. Anothercommon representation for signed integers is two's complementrepresentation. Alternatively, the integer may be unsigned, in which allof the bits of the bit sequence are used to represent the magnitude ofthe number.

Floating-point representation is used to represent numbers in mostcurrent implementations of neural network processing.

A standard floating-point representation FP32, known as single-precisionfloating-point format, uses 32 bits in memory and can represent a verylarge range of numbers (from ˜10^(Δ38) to ˜10³⁸). Lower-precisionformats using 16 bits (FP16) or even 8 bits (FP8) represent asignificant reduction in memory usage and computational cost,particularly when used for representing a deep learning model with up tomillions or billions of parameters. However, using fewer exponent andmantissa bits to represent a number leads to a reduction in the rangeand/or precision of representable values. Lower-precision formats canlead to two possible problems when applying arithmetic operations tonumbers stored in this format: numerical underflow and overflow.Underflow occurs when the absolute value of a number is too small to berepresented in the chosen number format. Where an arithmetic operationgives a result which is too small to be represented in the chosenfloating-point format, leading to underflow, the number will instead berepresented as zero. Numerical overflow occurs when the absolute valueof a number is too large to represent in the chosen format. In thiscase, the number may be represented as positive or negative infinity, orbe saturated (‘clipped’) to the maximum positive ort negative numberthat can be represented by the chosen format. As mentioned above, forFP32 numerical overflow only occurs for numbers with absolute valuesgreater than ˜10³⁸. However, for FP8, numerical overflow occurs at muchlower absolute values, and overflow could occur when training a neuralnetwork, if weights, activations or gradients grow sufficiently large.

Some methods of offsetting the effects of overflow and underflow forlower-precision numerical formats use the concept of exponent bias.

Another way of reducing instances of numerical underflow and overflow inlower-precision formats is to apply a scaling factor to variables whichare likely to take very small or very large values and are thereforeprone to underflow or overflow. Applying a scaling factor to a lossfunction, which in turn scales the gradients of the loss function usedin training the network, is referred to as loss scaling. Values ofweights and activations may also be scaled by adjusting the exponentbias term applied of the floating-point representations. A challenge isto choose the scale of the gradients so as to minimise both underflowand overflow and maximise the accuracy of the representations.

SUMMARY

A first aspect disclosed herein provides a computer-implemented methodof training, based on a set of training data, a multi-layer neuralnetwork comprising a set of network weights, the method comprising:processing the training data in respective forward and backward passesthrough a sequence of layers of the network, the forward pass comprisingcomputing a set of activations by applying an activation function independence on the network weights and training data, and the backwardpass comprising: computing gradients of a pre-determined loss functionwith respect to the network weights and/or computing gradients of thepre-determined loss function with respect to the computed activations ofthe network, wherein an adjustment parameter is applied to at least asubset of values in the neural network, the values comprising at leastone of: the network weights, the activations computed in the forwardpass, the gradients with respect to activations computed in the backwardpass, and the gradients with respect to weights computed in the backwardpass; updating the network weights in dependence on the computedgradients with respect to the weights; computing a proportion of thesubset of values falling above a predefined threshold; and updating theadjustment parameter applied to the subset of machine learningparameters in dependence on the computed proportion.

It should be noted both the terms ‘signal’ and ‘value’ are used hereinto refer collectively to the weights, activations and gradients of thenetwork.

BRIEF DESCRIPTION OF FIGURES

For a better understanding of the present disclosure, and to show howembodiments of the same may be carried into effect, reference is made byway of example only to the following figures in which:

FIG. 1A illustrates an example machine intelligence model in the form ofa neural network.

FIG. 1B provides a simplified representation of an example node.

FIG. 1C shows an extremely simplified version of one arrangement ofnodes in a neural network.

FIG. 2 shows a schematic block diagram for training a neural network bygradient descent.

FIG. 3 shows how clipping may be applied when weights exceed the maximumrepresentable value for a number format.

FIG. 4 shows an example of how a loss scaling factor may be appliedduring training of a neural network.

FIG. 5 shows how a histogram of values may be used to select a scalingfactor automatically.

FIG. 6 shows a flow diagram of updating a loss scaling factor.

DETAILED DESCRIPTION

Certain factors should be considered when selecting a numerical formatwith which to represent data, for example weights and activations of adeep learning model, including computational and communicationefficiency, as well as accuracy. As described above, a standardfloating-point representation FP32, known as single-precisionfloating-point format, uses 32 bits in memory, and can represent a verylarge range of numbers. Lower-precision floating-point formats occupyless space in memory than single-precision floating-point numbers. An8-bit floating-point representation comprising 1 sign bit, 5 exponentbits and 2 mantissa bits occupies only 8 bits of memory compared to anumber represented in 32-bit format, which occupies 32 bits, or 4 bytesof memory. However, 8-bit floating-point formats or FP8 formats have thecost of a narrower range and lower precision of representable values,making underflow and overflow more likely.

Deep learning models are usually trained using gradient descent methods.These are described in more detail later, but in general these use atechnique known as backpropagation in which a gradient with respect toweights at a given layer of the network is determined as a function ofthe activations and gradients with respect to activations at thefollowing layer of the network. For many deep learning models thesefunctions include matrix multiplications. This may lead to underflowwhen many small quantities are multiplied together, for example as theresult of the successive multiplication of gradients duringbackpropagation. Numerical underflow thus causes an issue when smallgradients occur in a network, as the gradients cannot be accuratelycomputed and propagated through the network. Numerical overflow may alsooccur, typically when weights or activations grow too large duringtraining to be accurately represented in the chosen format. Whileoverflow can in theory also occur for gradients, for example when manylarge gradients are multiplied together during backpropagation, inpractice gradients take smaller values than weights and activations onaverage.

Some methods of offsetting the effects of overflow and underflow forlower-precision numerical formats use the concept of exponent bias.Standard floating-point numbers use a fixed exponent bias b in order tostore the exponent of the floating-point number as an unsigned value,such that when the bias is applied the exponent can have positive ornegative values. For example, standard single-precision floating-pointnumbers have exponent values in the range −126 to +127 once the exponentbias b has been applied. It should be noted that applying a negativebias b to the exponent of a floating-point representation shifts therepresentable values down, which provides an equivalent effect tomultiplying the number by a scaling factor equal to 2⁸. Therefore, it ispossible to effectively change the representable range of afloating-point number either by multiplying the number by a scalingfactor directly, or by adding or subtracting a bias to/from the exponentof the floating-point representation of that number. Scale factors andexponent biases may be referred to herein as adjustment parameters astheir application to floating-point representations of weights,activations and gradients of a model can be used to adjust the range ofrepresentable values according to the data.

As mentioned above, one way of reducing instances of numerical underflowand overflow in lower-precision formats is to apply a scaling factor tovariables which are likely to take very small or very large values andare therefore prone to underflow or overflow, for example by applying amultiplicative factor to scale gradients of the network in order toperform computations which do not result in underflow.

Note that, as described above, deep learning models may be trained bycomputing gradients with respect to a loss function and updating themodel parameters based on the computed gradients. Therefore, applying aconstant scaling factor to the loss function is equivalent to applying aconstant scaling factor to the gradients of the loss function. Herein ascaling factor may be referred to as a ‘loss scaling factor’, but itshould be noted that this is the same as multiplying the gradient of theloss function by the same factor.

Described below is a method of scaling gradients of a deep neuralnetwork during training in dependence on the gradient statistics toenable gradients to be stored in a lower-precision format. An overviewof neural networks and gradient-based training methods will first beprovided.

FIG. 1C shows an extremely simplified version of one arrangement ofnodes in a neural network. This type of arrangement is often used inlearning or training and comprises an input layer of nodes, a hiddenlayer of nodes and an output layer of nodes. In reality, there will bemany nodes in each layer. Each node of the input layer N_(i) is capableof producing at its output an activation or node value which isgenerated by carrying out a function on data provided to that node. Avector of node values from the input layer is scaled by a vector ofrespective weights at the input of each node in the hidden layer, eachweight defining the connectivity of that particular node with itsconnected node in the hidden layer. In practice, networks may havemillions of nodes and be connected multi-dimensionally, so the vector ismore often a tensor. The weights applied at the inputs of the node N_(h)are labelled w₀, . . . , w₂. Each node in the input layer is connectedat least initially to each node in the hidden layer. Each node in thehidden layer can perform an activation function on the data which isprovided to them and can generate similarly an output vector which issupplied to each of the nodes N_(o) in the output layer. Each nodeweights its incoming data, for example by carrying out the dot productof the input activations of the node and its unique weights for therespective incoming links. It then performs an activation function onthe weighted data. The activation function can be for example arectified linear unit (ReLU). The network learns by operating on datainput at the input layer, assigning weights to the activations from eachnode and acting on the data input to each node in the hidden layer (byweighting the input data and performing the activation function). Thus,the nodes in the hidden layer operate on the weighted data and supplyoutputs to the nodes in the output layer. During training, the weightsare corrected using an error signal aiming at minimizing a selected lossfunction. This signal is typically provided by the gradient of the losswith respect to the weights. There are different learning approaches,but in each case there is a forward propagation through the network fromleft to right in FIG. 1C, a calculation of the gradient of the loss, anda backward propagation of the gradient from right to left in FIG. 1through the network. In the next cycle, each node takes into account theback-propagated gradient and produces a revised set of weights. In thisway, the network can be trained to perform its desired operation.

Machine learning models can comprise up to millions or billions ofparameters and can require significant amounts of training data toprovide good performance. Thus, computing resources required for machinelearning models are significant, both in terms of memory for storingparameters and intermediate data, as well as computing power to carryout arithmetic operations of large numbers of variables at once. One wayto reduce the computational cost of processing large amounts of data isto use a lower-precision numerical format to represent weights andactivations of the network, as well as gradients of the loss functionwhich are used to compute updates in training.

Low-precision floating-point formats have a limited range of numberswhich can be represented compared with single-precision floating-point,which uses 32 bits to represent numbers spanning a range of absolutevalues, from 10⁻³⁸ to 10³⁸. Throughout training of a neural network, thescale of weights, activations and gradients may vary significantly suchthat a relatively large range of scales need to be represented.

A standard method of training a neural network using gradient descentwill now be described with reference to FIG. 2 . A neural networkcomprises a set of n layers, each of which comprises a set of ‘neurons’.Each neuron in a layer computes a function of an input array to thatlayer based on the application of the layer parameters or weights, andapplies a non-linear activation function such as a ReLU function. Theresulting activations from different neurons are fed forward as theinput array to the next layer. This may be referred to as a forwardpass. Various types of neural networks exist including fully-connectedneural networks, convolutional neural networks (CNNs) and recurrentneural networks (RNN). These are known in the art and their individualfeatures will not be described herein.

The goal of learning is to arrive at a set of network weights thatminimise some training objective. At a final layer of the network, aprediction is output, which may depend on the task the network isdesigned to perform. For example, for an image processing task, wherethe input is an image containing an object, the network may bestructured to output a predicted class of the object, given a set ofpossible classes. Typically, a network is trained by providing a set oftraining data for which the correct output is known, and defining a lossfunction 100 which measures the cost of using the network prediction fora given input instead of the ‘correct prediction’ corresponding to thatinput. The network weights may be initialized based on random value withcertain statistics. However, during training the network weights areupdated so as to minimise the loss function 100, i.e. to make thenetwork predictions as close as possible to the ‘correct’ predictions.

A common optimisation scheme used to minimise the loss function isgradient descent. According to gradient descent, a gradient of the lossfunction may be computed with respect to the weights of the network, andeach weight may be updated in the opposite direction to its respectivecomponent of the gradient, therefore ‘pushing’ the weights in thedirection of minimal loss. The gradient with respect to the activationsmay also be computed as an intermediate step before computing thegradient of the loss function 100 with respect to the weights. Since theactivations are a function of the weights, and the loss function 100 isa function of the activations, applying the chain rule allows thegradient with respect to the weights to be calculated based on thegradient with respect to the activations V _(A). The gradientcalculation 406 is shown in FIG. 2 . The loss function 100 is provided,along with the weights and activations 404 of the final layer in orderto compute the gradient of the loss with respect to the weights of thatlayer. The results of these gradients are passed back through theearlier layers through backpropagation, where the gradients of theearlier layers are computed as a function of the gradients 402 in laterlayers using the chain rule.

Backpropagation is well-understood in the art and therefore will not bedescribed in further detail herein.

As mentioned above, the weights may be updated so as to ‘push’ theweights in the direction of the negative gradient. In other words, aterm proportional to the component of the gradient corresponding to thegiven weight may be subtracted from the current value of the weight asfollows:

$\begin{matrix}{w_{i} = {w_{i} - {\eta\frac{\partial({Loss})}{\partial w_{i}}}}} & (1)\end{matrix}$

where is a learning rate controlling the size of the update. This isshown by the weight update 408 applied at each layer.

Note that the term ‘gradient’, while technically referring to a vectorof partial derivatives, is used more generally herein to refer to boththe gradient with respect to a weight or activation vector and thecorresponding partial derivative components, computed with respect to asingle weight or activation of the network. In other words, ‘gradient’herein can refer to either individual components of a vector of partialderivatives or to the vector itself. A reference to the magnitude ofgradients refers to the magnitude of individual partial derivatives ofthe loss function with respect to a given weight or activation.

Depending on the form of gradient descent used, the weights may beupdated based on one training example at a time, or more commonly basedon an aggregated gradient computed for a subset of the trainingexamples, which may be referred to as a minibatch. In this case, anaccumulation operation is applied to get an aggregated (e.g. average)gradient to be applied in the respective weight update. Each layerupdates their respective weights based on the respective gradients withrespect to the weights at that layer, as shown by the multiple weightupdates 408 in FIG. 2 . The updated weights are then used in the forwardpass for a next iteration of training.

The techniques described below provide a way to automatically scalegradients of a neural network based on their statistics, wherereferences to gradients of the network herein include both gradientswith respect to weights and gradients with respect to activations.

Low Precision Formats

An issue with storing weights, activations and gradients inlow-precision floating-point format, such as FP16 or FP8 is that theweights or activations and the gradients often take on a wide range ofvalues. Weights and activations may grow beyond the range of numbersrepresentable in these formats, and the magnitude of gradients may fallbelow the lowest representable non-zero value.

As mentioned above, floating-point numbers may be represented by a signbit, a number of mantissa bits, and a number e of exponent bits. Oneexample 8-bit format uses 5 exponent bits and 2 mantissa bits. This maybe referred to as 1.5.2 format.

In addition to numbers falling outside the representable range ofvalues, using a floating-point format representation means that numbersare not represented continuously. Only numbers which can be representedas a sum of powers of two (with the range of exponents dictated by thenumber of exponent bits) have an exact representation in the chosenfloating-point format. In a simple example, where the smallest numberrepresentable by the exponent bits is 2⁻²=0.25 then any number between,e.g. 2 and 2.25 in decimal cannot be expressed accurately in this formatand would therefore be rounded to the nearest of these two values. Theresulting representation error is referred to as rounding noise. Forweights, activations and/or gradients of a neural network, it isimportant that the format is chosen such that the corresponding numberscan be represented with low quantization noise, where quantization noiseincludes both saturation errors and rounding errors, i.e., limiting theloss of accuracy due to saturation and underflow and the loss ofaccuracy due to rounding noise.

As neural networks are trained, the weights are updated and theactivations and gradients are recomputed at each of a set of trainingiterations. During this process, some weights and activations may growcomparatively large. If a weight or activation falls beyond the upperlimit of representable numbers, there is no way to store the correctvalue in the given format. Therefore, a process known as saturation or‘clipping’ may be applied. FIG. 3 shows how a weight may fall outside ofa range of representable values during training. At the top of FIG. 3 ,a schematic line is shown which represents the range of numbers whichare representable by a chosen numerical format. A smallest non-zeronumber 200 represents the smallest absolute value which can berepresented in the given representation, while a largest number 202represents the largest absolute value that can be represented in thechosen format. For simplicity, positive numbers will be used. However,an identical process occurs for very large negative numbers. A givenweight w of the network is shown close to the upper end of the range ofrepresentable numbers. After a weight update 408 is applied, the valueof the new weight w′ may increase beyond the range of representablevalues. This may be referred to as ‘overflow’. To continue processing onthe weights, a clipping 206 may be used to represent w′ as the largestrepresentable value instead of its actual value.

For gradients, which are more likely than weights or activations to takeon smaller values, a common problem is underflow, wherein the value ofthe gradient is too small to be represented accurately. This may bemitigated by applying a scaling factor to the gradients (or to the lossfunction from which the gradients are computed), in a process known asloss scaling. In this case, the scale of the gradients may be increasedso as to effectively represent the gradients in a low-precision formatswhile carrying out expensive computations such as matrix multiplicationand convolutions, and to scale the results back down by the same factorafterwards. However, applying too large a scaling factor may causeoverflow, and therefore may cause some gradients to be clipped.

Adaptive Loss Scaling for the Backward Pass

One method of loss scaling identifies when a loss scaling factor shouldbe increased based on when clipping events are observed. This may bereferred to as ‘Backoff scaling’, as described, for example in NvidiaOpenSeq2Seq documentation, in a section titled ‘Mixed PrecisionTraining’(https://nvidia.github.io/OpenSeq2Seq/html/mixed-precision.html). Underthis method, the loss scaling factor may be increased until a gradientbecomes large enough to be clipped, at which point the loss scalingfactor is ‘backed off’ to a lower level, from which it progressivelyincreases until the next clipping event. This is based on the premisethat clip events need to be avoided.

However, the inventors have recognised that neural network training hassome tolerance of a small amount of clipping of gradients, and thatthere may not be a requirement to avoid all clipping events. They haverecognised that a better performance may be obtained by reducing orincreasing the loss scaling factor in dependence on the statisticalproperties of the gradients, for example a proportion of the gradientsthat fall above a certain threshold which indicates that a saturation ofthe upper end of the representable range has occurred. It should benoted that ‘proportion’ herein may refer to any relative count ofgradients with respect to an overall set of gradients, and is notnecessarily limited to a percentage. A software-implemented program willnow be described which collects statistics for the gradients of thenetwork and updates a loss scaling factor in dependence on thestatistics in order to provide an optimal representation of gradientsfor a given format. It should be noted that the computer program mayalso be configured to collect statistics and adjust the format for theforward pass, which is described in more detail later.

FIG. 4 shows how a neural network may be trained while applying a lossscaling factor L to the loss function 100 based on gradient statistics.As described for FIG. 2 , the network processes training data in aforward pass through a series of layers 402, at the end of which a lossfunction 100 is defined. In this case the loss function is multiplied bya loss scaling factor 420 to obtain a scaled loss function 416. The lossscaling factor may be initialised as any value. Gradients are thencomputed for scaled loss with respect to the weights and activations.This means that the gradients 406 computed with respect to weights andactivations at each layer are scaled up or down by the same loss scalingfactor 420. The gradients are propagated back through the network in abackward pass.

At each loss scaling update step, the statistics of the computedgradients are computed. Each loss scaling update step may be, forexample, at every hundred training iterations. The frequency at whichgradient statistics and loss scaling factor updates are computed may beuser-configurable. The statistics may be computed based on a set of oneor more thresholds, for example, a histogram of the gradients at eachlayer falling into each of a set of bins defined by bin edge thresholds422 may be determined. The statistics may be accumulated by anaccumulation operation 412 summing the histograms for each layer forexample. The accumulation may only aggregate the statistics for a subsetof the layers of the network.

Note that the gradients may be accumulated over all layers inaccumulation 412 with a single threshold 422 applied to determine theloss scaling factor, or a separate threshold 422 may be applied at eachlayer to determine a proportion above each threshold at each layer,before aggregating the computed proportions in the accumulationoperation 412.

After the accumulated statistics are determined, a loss scalingalgorithm 414 is applied to update the loss scaling factor based on thestatistics. For example, where a histogram of two bins is computed, andthe number of gradients in one of the bins is above some predefinedproportion of the total number of gradients, then the scaling factor isreduced so as to avoid too many gradients from reaching the upper end ofthe representable range resulting in too many clipping events tomaintain good performance. This is described in more detail withreference to FIG. 5 .

The weights are updated within an optimiser 416 based on an accumulation410 of the gradients computed for the given layer over a minibatch oftraining data. The optimiser 416 applies the weight update according toa gradient update rule such as equation (1). One example optimisationalgorithm used in the field of machine learning is the Adam optimiserwhich applies a particular type of stochastic gradient descent update.Other gradient-based optimisation algorithms are known in the art, anyof which may be used to train a neural network according to the methodshown in FIG. 4 . To correct the scaling factor, any gradient multipliedby L at the end of the forward pass should be divided by L before theweight update is applied, to ensure that the weight is updated by anappropriate amount. The weight update 408 is shown in FIG. 4 , with theloss scaling factor L being provided for scaling the weight update. Themethod of re-scaling the loss may depend on the format in which theweights updates are processed. For example, the gradients may be simplydivided by L before processing by the optimiser. Alternatively,depending on the optimiser, the loss scaling factor may be absorbed bypart of the computation carried out in the optimizer to obtain thecorrection to be applied in the weight update. Optimisers are known inthe art and will not be discussed further herein.

Training iterations repeat with the same loss scaling factor until thenext loss scaling factor update step.

FIG. 5 shows how the loss scaling factor affects gradient statistics andprevents overflow. On the left of FIG. 5 , a simplified distribution ofgradient values is shown. This distribution shape is merely illustrativeand is not intended to represent a realistic distribution of gradientvalues. A threshold T is shown, above which a small number of gradientslie. The same distribution is represented in a quantised form as ahistogram of two bins: the first bin h₁ gives a count of all thegradients lying below the threshold T and the second bin h₂ gives acount of all the gradients lying above the threshold T. For the purposeof avoiding overflow, a useful statistic is to determine what proportionof the gradients are above a value close to the maximum representablevalue in the given numerical format. For example, for FP16, a thresholdmay be chosen at half of the maximum number representable by FP16, whichis 32752. The proportion of gradients above this threshold gives anindication of how much clipping is occurring in the network. A minimumproportion f may be set at which it is determined that the loss scalingfactor is too high, and should be reduced. An example proportion f forFP16 may be chosen to be 10⁻⁶, for example. Note that the histograms inFIG. 5 are not to scale. As mentioned above, different thresholds 422may be applied at each layer.

Once it is determined that a proportion greater than f of the gradientslie above the threshold T, the loss scaling factor may be reduced by afactor s. This has the effect of shifting the distribution of gradientsdown, once they are scaled by this factor, such that a smallerproportion of the gradients lie above the threshold T. An algorithm maybe applied which either increases the loss scaling factor at every lossscaling factor update step if it is below a threshold, or only updatesthe loss scaling factor after a number of consecutive update stepswherein the proportion above the threshold is below the criticalfraction f.

A gradient histogram may be computed which comprises more than two bins.In the case where the gradient histogram comprises more than two bins,with bin edges {b₁, b₂, . . . , b_(n−1)} and bin counts {h₁, h₂, . . . ,h_(n)}, then for a given threshold T, and after M consecutive optimizersteps, the loss scaling factor L is increased only if the proportion ofthe total count of all bins whose edges are greater than or equal tothreshold T does not exceed the user defined fraction f . That is to say

$\frac{\Sigma_{b_{i} \geq T}h_{i}}{\Sigma h_{i}} \leq {f.}$

The loss scaling factor is decreased otherwise.

FIG. 6 shows a flow chart of how a loss scaling factor L may be updatedautomatically based on gradient statistics computed periodically duringtraining of a deep learning model. Optimisation is carried out over anumber of training iterations, with the current iteration given by anoptimiser step count. At the start of training, the optimiser count isinitialised to zero. A second count, referred to herein as a scalingcount, is initialised. This count is used to signal the number ofconsecutive training iterations the gradients satisfy the condition fornot increasing the scaling factor. The scaling factor itself is alsoinitialised. For example, the scaling factor may initially be set to 1,such that the gradients are not scaled up or down for the first trainingiterations, and once gradient statistics are known, the scaling factoris adjusted, as will be described below.

At each training iteration, a first step 602 computes the forward pass,and at step 604 the gradients are computed in a backwards pass. Theweights are updated at step 606 based on the computed gradients. A check610 is then done on the current optimiser count to see if the currentiteration is a multiple of the number of steps N defining the frequencyof computing gradient statistics. If the current optimiser count is nota multiple of N then at step 608 the count is updated by 1, and thescaling count is also updated by 1, since no change of the loss scalingfactor takes place at this step. If the current training iteration is amultiple of a predetermined number of iterations N defining thefrequency at which gradient statistics are computed, then after updatingthe current optimiser count at step 614, the gradient statistics arecomputed at step 616. These may be computed as a histogram of gradientvalues falling into two or more bins, for example. The statistics may becomputed for each layer of the network separately and accumulated forthe entire network. A condition 618 is then applied to see if theproportion of gradients above a threshold is larger than the criticalfraction f, where this critical fraction can be defined by the user. Ifthe proportion of gradients above the threshold is larger than f, thenthe scaling factor is reduced at step 620 by a factor s, and the scalingcount is reset to zero at step 624 to signify that the scaling factorhas been updated and this is the first iteration with the new lossscaling factor. In one example, s=2, and the loss scaling factor L ishalved. If the proportion of gradients above the threshold is less thanor equal to the critical fraction f, then a further check 622 isperformed to identify how many steps the proportion has been below thefraction, which is given by the scaling count. If the scaling count isat least M steps, then the loss scaling factor L is updated by a factors at step 624, and the scaling count is reset at step 626 to signifythat this is the first iteration with the new loss scaling factor. Inthe above example where s=2, the loss scaling factor is thereforedoubled for every M steps in which the proportion of gradients above thethreshold is less than the critical fraction f. If at step 622 thescaling count is less than M steps, then the scaling count isincremented by one at step 628, and a new training iteration begins witha forward pass 602, without any change in the loss scaling factor. Inpractice, it may be desirable to adjust the scaling factor up or down atevery iteration in which the gradient statistics are computed. This canbe achieved by setting M=N. In this case, the step 622 will not benecessary as the current scaling count will always be N=M.

Note that the factor s in the present example is applied both to thescaling up and the scaling down of the loss scaling factor. In otherimplementations different factors s₁ and s₂ may be used to scale up orscale down the loss scaling factor as required. These factors may beconstant, or may be adapted over the series of iterations based on thegradient statistics or other factors.

The above-described method of scaling up gradients allows gradients ofthe network to be stored and processed in computations in alower-precision representation such as FP8 or FP16 which results inimproved computational efficiency when processing gradients andcommunicating gradients between processors for multi-processor systems.Neural networks may combine the storage of gradients in a low precisionformat with higher precision representations of weights and activations.Alternatively, weights and activations may also be stored inlow-precision formats for processing in particular layers, such aslayers with convolutions and matrix multiplications. Any subset ofactivations, weights and gradients of the network may be selected forstorage in a low-precision format. References herein to ‘a subset ofactivations, weights and gradients’ includes subsets containing allmembers of one group, such as all activations or all weights, as well assubsets containing values from different groups, such as all activationsand all weights from the first layer.

In addition to scaling the loss, the representation of the gradients inthe backward pass may be adjusted by selecting an appropriate exponentbias, which offsets the exponent value in the chosen floating pointformat by a fixed amount, which is equivalent to applying a fixedmultiplicative factor.

Mathematical details of an example implementation of automatic lossscaling for an L-layer neural network model M will now be provided. Thisimplementation collects two histograms for gradients with respect toweights and gradients with activations, respectively, and uses anaggregation of the two histograms to determine whether to increase ordecrease the scale factor.

The loss estimated over a micro-batch (i.e. small subset of the overalltraining data) of size E is given by:

${\mathcal{R} = {{\frac{1}{B}{\sum\limits_{i = 0}^{B - 1}{\mathcal{L}( {{\mathcal{M}❘\Theta},x_{i}} )}}} = {\frac{1}{B}{\sum\limits_{i = 0}^{B - 1}{\mathcal{L}( {{g_{L - 1} \circ g_{L - 2} \circ \cdots \circ {g_{0}( x_{i} )}}❘\Theta} )}}}}},$

where each layer, l, for 0≤I≤L, with parameters θ_(l), such that θ_(l)=ϕif the layer is parameterless, is defined as the mapping of its input,and model parameters Θ=U_(i=D) ^(L−1)θ_(l), where x is the input to thenetwork. The composition of the first l+1 layers in the model is denotedby

_(l)=g_(l)·g_(l−1)· . . . g₀.

The following mathematical description is generally applicable todifferent configurations of neural network models, for example differentoptimisers, hyperparameter values, etc. In the present example, a singlehistogram H_(GW) is used for collecting statistics of weight gradients(i.e. gradients of the loss function with respect to weights of thenetwork), and a second single histogram

_(GX) is used for collecting statistics of activation gradients (i.e.gradients of the loss function with respect to activations of thenetwork). In this example implementation, the histograms are definedover the FP16 range, having as bin edges all exponents in the range[−24:15], although other ranges can be used in association with otherfloating point formats. For each gradient type, as subset 0≤L′≤L of thenetwork layers is used for statistics gathering, such that at least onehistogram is available.

Two alternative methods can be used in the present method to combine thebin count from both histograms. The first method combines the histogrambin values of the two histograms and determines whether the totalproportion of bin values exceeding a cut-off bin C defining a thresholdT, increasing the loss scaling factor if the following condition issatisfied:

${\frac{{\Sigma_{i \geq C}{\mathcal{H}_{GX}\lbrack i\rbrack}} + {\mathcal{H}_{GW}\lbrack i\rbrack}}{{\Sigma_{i < C}{\mathcal{H}_{GX}\lbrack i\rbrack}} + {\mathcal{H}_{GW}\lbrack i\rbrack}} < f},$

where f is the critical threshold. Otherwise the scaling factor isreduced. Excluding the underflow count, this condition is written:

${\frac{{\Sigma_{i \geq C}{\mathcal{H}_{GX}\lbrack i\rbrack}} + {\mathcal{H}_{GW}\lbrack i\rbrack}}{{\Sigma_{0 < i < C}{\mathcal{H}_{GX}\lbrack i\rbrack}} + {\mathcal{H}_{GW}\lbrack i\rbrack}} < f},$

The second method compares the proportion of the bin count exceeding arespective cutoff C to a respective critical fraction f separately foreach histogram and a joint decision to increase the loss scaling factoris only made if both tests pass (i.e. unanimous vote). Criticalfractions f_(CX) and f_(GW) and cutoff bins C_(GX) and C_(GW) areassumed for activation and weight gradients, respectively. The lossscaling factor is increased if the following condition is met:

${( {\frac{\Sigma_{i \geq C_{GX}}{\mathcal{H}_{GX}\lbrack i\rbrack}}{\Sigma_{i < C_{GX}}{\mathcal{H}_{GX}\lbrack i\rbrack}} < f_{GX}} )\bigwedge( {\frac{\Sigma_{i \geq C_{GW}}{\mathcal{H}_{GW}\lbrack i\rbrack}}{\Sigma_{i < C_{GW}}{\mathcal{H}_{GW}\lbrack i\rbrack}} < f_{GW}} )},$

which is written as follows where underflow counts are excluded:

$( {\frac{\Sigma_{i \geq C_{GX}}{\mathcal{H}_{GX}\lbrack i\rbrack}}{\Sigma_{0 < i < C_{GX}}{\mathcal{H}_{GX}\lbrack i\rbrack}} < f_{GX}} )\bigwedge{( {\frac{\Sigma_{i \geq C_{GW}}{\mathcal{H}_{GW}\lbrack i\rbrack}}{\Sigma_{0 < i < C_{GW}}{\mathcal{H}_{GW}\lbrack i\rbrack}} < f_{GW}} ).}$

Activation gradients computed for a micro-batch of training data aredependent on the size of the micro-batch. As the batch size increases,the activation gradients become smaller. For the histogram of bincounts, doubling the batch size halves all the activation gradients,having the effect of shifting the histogram

_(GX) down by one exponent bit, leading to greater underflow.

By contrast, weight gradients are computed as an average of theper-micro-batch-element weight gradient estimates, which means that, onexpectation, the weight gradient estimates will not change, when thebatch size increases. Therefore, the histogram of weight gradientestimates

_(GW) is unchanged by a changing batch size. The performance of theautomatic loss scaling therefore depends on the micro-batch size. If theweight and activation gradients statistics are combined by summing theirbin counts as described above, then the contribution from the activationgradients to the collected statistics vary depending on the micro-batchsize, both in terms of quantity and bin position. Furthermore, the ratioof weight gradients count to activation gradients count in the combinedhistogram is inversely proportional to the batch size, which means thatfor a higher batch size the information from the weight gradients willbe more diluted, making it hard to create a robust implementation of theALS algorithm.

To resolve this, the ALS algorithm can be constructed such that theratio of weight gradients count to the activation gradients count onlydepends on the model definition and remains constant irrespective of themicro-batch size being used. Such a ratio is denoted as ρ(

). Taking B=1 as a reference, where B is the size of the micro-batch,the activation gradients histogram is estimated per micro-batch element.This can be done by scaling the gradients by batch size B beforegathering statistics. This does not recover any amount of underflow andcomes with the additional cost of scaling the activation gradienttensors. Since automatic loss scaling uses cutoff bin edges, for a givenratio ρ(

), the weight gradient histograms are scaled by B (since activationgradient counts scale with batch size) and allow the activationgradients cutoff bin edge to be reduced each time the batch size isincreased. A sum-based condition for increasing the loss scaling factoraccording to this method can be written as:

${\frac{{\Sigma_{i \geq {C - {\lbrack{\log_{3}B}\rbrack}}}{\mathcal{H}_{GX}\lbrack i\rbrack}} + {B\Sigma_{i \geq C}{\mathcal{H}_{GX}\lbrack i\rbrack}}}{{\Sigma_{i < {C - {\lbrack{\log_{3}B}\rbrack}}}{\mathcal{H}_{GX}\lbrack i\rbrack}} + {B\Sigma_{i < C}{\mathcal{H}_{GX}\lbrack i\rbrack}}} < f},$

or, excluding the underflow count:

$\frac{{\Sigma_{i \geq {C - {\lbrack{\log_{2}B}\rbrack}}}{\mathcal{H}_{GX}\lbrack i\rbrack}} + {B\Sigma_{i \geq C}{\mathcal{H}_{GX}\lbrack i\rbrack}}}{{\Sigma_{0 < i < {C - {\lbrack{\log_{2}B}\rbrack}}}{\mathcal{H}_{GX}\lbrack i\rbrack}} + {B\Sigma_{0 < i < C}{\mathcal{H}_{GX}\lbrack i\rbrack}}} < {f.}$

A condition according to the ‘unanimous vote’ approach as describedabove can be written as:

${( {\frac{\Sigma_{i \geq {C_{GX} - {\lbrack{\log_{2}B}\rbrack}}}{\mathcal{H}_{GX}\lbrack i\rbrack}}{\Sigma_{i < {C_{GX} - {\lbrack{\log_{2}B}\rbrack}}}{\mathcal{H}_{GX}\lbrack i\rbrack}} < f_{GX}} )\bigwedge( {\frac{\Sigma_{i \geq C_{GW}}{\mathcal{H}_{GW}\lbrack i\rbrack}}{\Sigma_{i < C_{GW}}{\mathcal{H}_{GW}\lbrack i\rbrack}} < f_{GW}} )},$

or, excluding the underflow count:

$( {\frac{\Sigma_{i \geq {C_{GX} - {\lbrack{\log_{2}B}\rbrack}}}{\mathcal{H}_{GX}\lbrack i\rbrack}}{\Sigma_{0 < i < {C_{GX} - {\lbrack{\log_{2}B}\rbrack}}}{\mathcal{H}_{GX}\lbrack i\rbrack}} < f_{GX}} )\bigwedge{( {\frac{\Sigma_{i \geq C_{GW}}{\mathcal{H}_{GW}\lbrack i\rbrack}}{\Sigma_{0 < i < C_{GW}}{\mathcal{H}_{GW}\lbrack i\rbrack}} < f_{GW}} ).}$

As batch size increases, so does underflow of activation gradients,meaning that increasing the loss scaling to improve the representationof activation gradients causes a loss of representation at the upper endof the range, which results in gradient clipping of the weightsgradients. Furthermore, as training evolves, and as the histogramsdiverge due to values becoming smaller or larger, the decision to reduceor increase the loss scaling factor is dominated by the statisticscollection that exceeds the cutoff bin edges count threshold faster.

Alternatively, to manage the underflow of activation gradients with thesaturation of weight gradients, the ALS algorithm can be designed towork with different scaling factors, α_(GX) and α_(GW) for activationgradients and weight gradients respectively. The loss is scaled byα_(GX) while the result of the weight gradients calculation for a givenlayer are scaled by α_(GW)/α_(GX) to reflect the desired weightgradients scaling. The difference between the two scaling factors can befixed to the micro-batch size or can be dynamically set based on thetensor statistics. Scaling based on tensor statistics allowsmaximisation of the use of the available dynamic range, in particular asthe gradients diverge during training.

Furthermore, different loss scaling factors can be computed fordifferent layers, or different blocks of the neural network. Statisticsare gathered based on the chosen layer granularity, and these statisticsare used to update the scaling factors for the scope's activation andweight gradients during the backward pass computation.

It should be noted that, while the above description relates toactivation and weight gradients, the described techniques can be appliedto determine a scale factor for any two quantities whose distributionsdiverge.

Adaptive Format Selection for the Forward Pass

In order to improve the accuracy of the quantization of weight andactivations in the forward pass, a similar principle to the scalingfactor described above may be applied to determine a representation fora set of values based on their statistics. In the forward pass,statistics of weights and activations are collected in order to maintaina separate histogram of the weights and activations, measuring thefraction of the total number of samples of the histogram that are abovea given threshold; and adjusting the exponent offset (or exponent bias)accordingly to maintain a predefined fraction of samples above the giventhreshold. In general, the histograms will comprise a plurality of bins.

Adjusting the exponent bias of weights and activations shifts therepresentable range of values for these weights and activations. If theexponent bias or offset is increased, the range of representable valuesis shifted to lower magnitudes.

In the forward pass, histograms are collected for activations, gradientswith respect to weights and gradients with respect to outputs, where thegoal is to determine an appropriate format for representing thesevalues. As described above for gradients, histograms have at least twobins with the histogram providing an aggregation of all values fallingwithin the ranges indicated by each bin. In general, histograms willcomprise more than two bins.

Histogram bins may be selected based on the format of the values beingcollected. For example, where the weights, activations and/or gradientsare being converted from an FP16 format to an FP8 format, then the binedges are selected at each power of two in the range of FP16 values.This includes exponents between the values of −24 and 15. If convertingfrom an FP32 format to FP8, then bin edges for the range of values ofFP32 would be chosen instead.

Histograms may be collected for a single layer of the network, or asingle histogram can be collected with aggregate statistics for valuesof multiple layers, assuming that the layers combined in the aggregationuse the same format for the relevant values. Sets of layers may beselected heuristically, or may be determined automatically by thecomputer program.

Training and implementation of a neural network model may be performedon a set of multiple processors, with each processor processing a subsetof the data, for example with each processor handling a mini-batch ofdata, and each processor having a local replica of the neural networkmodel. Each processor may compute its own histogram with the gradientsfor the respective subset of the data. Histograms may be communicated toother processors at the end of mini-batch computation in order todetermine aggregate statistics based on which a common representationfor gradients and/or activations and weights may be determined andapplied when converting said values to a particular format. Bin countsmay be represented in the form of a raw count, or as a proportion bydividing the counts by the sum of all bins of the aggregated histogram.Communication overhead for sending histograms to other processors mustbe balanced by the computational advantage of having an optimal scalingfactor when deciding on frequency at which statistics need to becomputed and communicated.

As described above for gradients, one criterion on which to determine anappropriate representation of a value based on the collected histogramis to apply a threshold, and to reduce the scaling factor applied inresponse to a number, or more typically, proportion of values in thehistogram exceeding the chosen threshold, indicating a degree ofoverflow. Other criteria may be determined for the collected values andused to adjust the representation of the values in the next stage oftraining. Some examples of such criteria include mean-square error,signal-to-noise ratio, degree of underflow, and Kullback-Leiblerdivergence.

In addition to selecting the bias for the exponent of weights andactivations which are to be expressed in a floating-point format such asFP8, the statistics may be used to select the type of format to be used.The spread of values in the histograms may indicate the most appropriateformat for representing the values. As mentioned above, floating-pointformats may allocate different numbers of bits to represent the mantissaand the exponent, which provide different representable ranges anddifferent numerical precision within those ranges. For an 8-bitfloating-point format, two possible formats are 1.4.3, which uses onesign bit, e=4 exponent bits and m=3 mantissa bits, and 1.5.2, which usesone sign bit, e=5 exponent bits and two mantissa bits. By collectinghistograms in a forward pass, it is possible to analyse the range ofvalues that need to be represented and select between different formatsaccording to the range. In general, it is desirable to choose the formatthat represents with as high a numerical precision as possible and forwhich most of the values within the range can be represented. Anappropriate choice of exponent bias may be determined for each ofmultiple candidate formats, for example an exponent bias can bedetermined for both of 1.4.3 and 1.5.2. The format may be selected fromthe set of candidate formats using the same or different criteria asthose described above for selection of the exponent bias. In the eventof more than one format having the same performance according to thegiven criterion, the format with the smallest exponent field size may bechosen, as this maximises precision by allowing more mantissa bits torepresent the given number.

Once a format is determined, i.e. a scaling factor, exponent bias,and/or appropriate allocation of bits to represent the exponent andmantissa, these can be applied to the respective values for subsequentsteps of training the neural network. These may be applied only to asubset of layers of the network, for example those in which matrixmultiplications and convolutions occur, as these layers are computeintensive and a lower precision format is most effective in improvingthe efficiency of such operations. The representation is applied whenthe given values are converted to the new format, for example whenconverting weights or activations to FP8 before performing a convolutionoperation.

A first aspect disclosed herein provides a computer-implemented methodof training, based on a set of training data, a multi-layer neuralnetwork comprising a set of network weights, the method comprising:processing the training data in respective forward and backward passesthrough a sequence of layers of the network, the forward pass comprisingcomputing a set of activations by applying an activation function independence on the network weights and training data, and the backwardpass comprising: computing gradients of a pre-determined loss functionwith respect to the network weights and/or computing gradients of thepre-determined loss function with respect to the computed activations ofthe network, wherein an adjustment parameter is applied to at least asubset of values in the neural network, the values comprising at leastone of: the network weights, the activations computed in the forwardpass, the gradients with respect to activations computed in the backwardpass, and the gradients with respect to weights computed in the backwardpass; updating the network weights in dependence on the computedgradients with respect to the weights; computing a proportion of thesubset of values falling above a predefined threshold; and updating theadjustment parameter applied to the subset of machine learningparameters in dependence on the computed proportion.

In embodiments, the adjustment parameter is a scale factor, and whereinthe scale factor is applied on the backward pass to at least a subset ofthe gradients with respect to the activations and/or the gradients withrespect to the network weights, wherein the scale factor is updated independence on the proportion of the gradients of that subset that have avalue falling above a pre-defined threshold.

In embodiments, the adjustment parameter is a scale factor, and thescale factor is applied on the backward pass to at least a subset of thegradients with respect to at least one of the activations and thegradients with respect to the network weights, wherein the scale factoris updated in dependence on the proportion of the gradients of thatsubset that have a value falling above a pre-defined threshold.

In embodiments, the method comprises applying the scale factor to atleast one of gradients with respect to weights and gradients withrespect to activations of all layers of the network by multiplying theloss function by the scale factor.

In embodiments, the method comprises constructing a histogram ofgradients, the histogram comprising a plurality of bins, wherein thescale factor is updated based on a proportion of gradients occupyingbins above a threshold value.

In embodiments, the method comprises constructing a respective histogramof gradients for each layer of the neural network, wherein theproportion of gradients occupying each of a set of bins for eachhistogram is input to an accumulator to obtain an aggregated proportionfor each bin, the scale factor being derived by computing an aggregatedproportion occupying bins above an overall threshold.

In embodiments, the method comprises constructing a respective histogramof gradients for each layer, wherein for each layer a respectivelayer-wise scale factor is applied during the backward pass, thelayer-wise scale factor being updated based on a proportion of gradientsin the histogram for the corresponding layer occupying bins above acorresponding layer-wise threshold value.

In embodiments, the method is implemented on a plurality of processors,wherein each processor processes a respective subset of the trainingdata in each of the forward and backward passes, and computes arespective histogram of gradients for the corresponding subset of thetraining data, each histogram having defined a common set of bins,wherein the proportion of gradients occupying each bin of the set ofbins defined for each histogram is aggregated to obtain an aggregatedproportion for each bin, with a scale factor being derived by computingan aggregated proportion occupying bins above an overall threshold.

In embodiments, the method comprises storing at least a subset of thenetwork weights, gradients and activations in computer memory infloating-point format.

In embodiments, the method comprises storing at least a subset of thenetwork weights, gradients and activations in computer memory ineight-bit floating-point format.

In embodiments, the method comprises storing at least a subset of thenetwork weights, gradients and activations in computer memory insixteen-bit floating-point format.

In embodiments, the method comprises storing the subset of values in afloating-point format, and wherein the adjustment parameter is anexponent bias applied to the floating-point representations of thesubset of weights, gradients and activations.

In embodiments, the subset of values in the neural network is a subsetof network weights and activations and the adjustment parameter is anexponent bias applied to the subset of values of the network weights andactivations in the forward pass.

In embodiments, a subset of network weights, activations and gradientswhich are inputs to compute operations in at least one of the forwardand backward passes are stored in eight-bit floating-point format, thecompute operations comprising at least one of a matrix operation and aconvolution operation .

A second aspect herein provides a computer system comprising one or moreprocessors configured to train a multi-layer neural network comprising aset of network weights, and memory holding the network weights, theprocessor configured to train the neural network by:

-   -   receiving a set of training data;    -   processing the training data in respective forward and backward        passes through a sequence of layers of the network, the forward        pass comprising computing a set of activations by applying an        activation function in dependence on the network weights and        training data, and the backward pass comprising determining a        set of gradients of a pre-determined loss function with respect        to the weights and/or activations of the network, wherein an        adjustment parameter is applied to at least a subset of values        in the neural network, wherein the values on the forward pass        comprise at least one of the network weights and computed        activations, and the values on the backwards pass comprise the        computed gradients with respect to activations and gradients        with respect to weights;    -   storing the values to memory;    -   updating the network weights in dependence on the computed        gradients with respect to the weights;    -   on at least one of the forward and backward pass, computing a        proportion of the subset of values falling above a predefined        threshold; and

updating the adjustment parameter applied to the subset of machinelearning parameters in dependence on the computed proportion.

In embodiments, the computer system comprises a plurality of processors,wherein each processor is configured to process a respective subset ofthe training data.

In embodiments, the adjustment parameter is updated in dependence on anaggregated proportion of values for all processors falling above apredefined threshold, the aggregated proportion computed by aggregatinga computed proportion of the subset of values falling above thepredefined threshold for each of the plurality of processors.

A further aspect of the present disclosure provides a non-transitorycomputer-readable storage medium storing computer program instructionswhich when executed perform a method of training, based on a set oftraining data, a multi-layer neural network comprising a set of networkweights, the method comprising:

-   -   processing the training data in respective forward and backward        passes through a sequence of layers of the network, the forward        pass comprising computing a set of activations by applying an        activation function in dependence on the network weights and        training data, and the backward pass comprising determining a        set of gradients of a pre-determined loss function with respect        to the weights and/or activations of the network, wherein an        adjustment parameter is applied to at least a subset of values        in the neural network, and wherein the values on the forward        pass comprise at least one of the network weights and computed        activations, and the values on the backwards pass comprise the        computed gradients with respect to activations and gradients        with respect to weights;    -   updating the network weights in dependence on the computed        gradients with respect to the weights;    -   on at least one of the forward and backward pass, computing a        proportion of the subset of values falling above a predefined        threshold; and updating the adjustment parameter applied to the        subset of machine learning parameters in dependence on the        computed proportion.

1. A computer-implemented method of training, based on a set of trainingdata, a multi-layer neural network comprising a set of network weights,the method comprising: processing the training data in respectiveforward and backward passes through a sequence of layers of the network,the forward pass comprising computing a set of activations by applyingan activation function in dependence on the network weights and trainingdata, and the backward pass comprising: computing gradients of apre-determined loss function with respect to the network weights and/orcomputing gradients of the pre-determined loss function with respect tothe computed activations of the network, wherein an adjustment parameteris applied to at least a subset of values in the neural network, thevalues comprising at least one of: the network weights, the activationscomputed in the forward pass, the gradients with respect to activationscomputed in the backward pass, and the gradients with respect to weightscomputed in the backward pass; updating the network weights independence on the computed gradients with respect to the weights;computing a proportion of the subset of values falling above apredefined threshold; and updating the adjustment parameter applied tothe subset of machine learning parameters in dependence on the computedproportion.
 2. The method of claim 1, wherein the adjustment parameteris a scale factor, and wherein the scale factor is applied on thebackward pass to at least a subset of the gradients with respect to atleast one of the activations and the gradients with respect to thenetwork weights, wherein the scale factor is updated in dependence onthe proportion of the gradients of that subset that have a value fallingabove a pre-defined threshold.
 3. The method of claim 2, comprisingapplying the scale factor to at least one of gradients with respect toweights and gradients with respect to activations of all layers of thenetwork by multiplying the loss function by the scale factor.
 4. Themethod of claim 2, comprising constructing a histogram of gradients, thehistogram comprising a plurality of bins, wherein the scale factor isupdated based on a proportion of gradients occupying bins above athreshold value.
 5. The method of claim 4, comprising constructing arespective histogram of gradients for each layer of the neural network,wherein the proportion of gradients occupying each of a set of bins foreach histogram is input to an accumulator to obtain an aggregatedproportion for each bin, the scale factor being derived by computing anaggregated proportion occupying bins above an overall threshold.
 6. Themethod of claim 4, comprising constructing a respective histogram ofgradients for each layer, wherein for each layer a respective layer-wisescale factor is applied during the backward pass, the layer-wise scalefactor being updated based on a proportion of gradients in the histogramfor the corresponding layer occupying bins above a correspondinglayer-wise threshold value.
 7. The method of claim 4 when implemented ona plurality of processors, wherein each processor processes a respectivesubset of the training data in each of the forward and backward passes,and computes a respective histogram of gradients for the correspondingsubset of the training data, each histogram having defined a common setof bins, wherein the proportion of gradients occupying each bin of theset of bins defined for each histogram is aggregated to obtain anaggregated proportion for each bin, with a scale factor being derived bycomputing an aggregated proportion occupying bins above an overallthreshold.
 8. The method of claim 1, comprising storing at least asubset of the network weights, gradients and activations in computermemory in floating-point format.
 9. The method of claim 8, comprisingstoring at least a subset of the network weights, gradients andactivations in computer memory in eight-bit floating-point format. 10.The method of claim 8, comprising storing at least a subset of thenetwork weights, gradients and activations in computer memory insixteen-bit floating-point format.
 11. The method of claim 8, comprisingstoring the subset of values in a floating-point format, and wherein theadjustment parameter is an exponent bias applied to the floating-pointrepresentations of the subset of weights, gradients and activations. 12.The method of claim 11, wherein the subset of values in the neuralnetwork is a subset of network weights and activations and theadjustment parameter is an exponent bias applied to the subset of valuesof the network weights and activations in the forward pass.
 13. Themethod of claim 11, wherein a subset of network weights, activations andgradients which are inputs to compute operations in at least one of theforward and backward passes are stored in eight-bit floating-pointformat, the compute operations comprising at least one of a matrixoperation and a convolution operation.
 14. A computer system comprisingone or more processors configured to train a multi-layer neural networkcomprising a set of network weights, and memory holding the networkweights, the processor configured to train the neural network by:receiving a set of training data; processing the training data inrespective forward and backward passes through a sequence of layers ofthe network, the forward pass comprising computing a set of activationsby applying an activation function in dependence on the network weightsand training data, and the backward pass comprising determining a set ofgradients of a pre-determined loss function with respect to the weightsand/or activations of the network, wherein an adjustment parameter isapplied to at least a subset of values in the neural network, whereinthe values on the forward pass comprise at least one of the networkweights and computed activations, and the values on the backwards passcomprise the computed gradients with respect to activations andgradients with respect to weights; storing the values to memory;updating the network weights in dependence on the computed gradientswith respect to the weights; on at least one of the forward and backwardpass, computing a proportion of the subset of values falling above apredefined threshold; and updating the adjustment parameter applied tothe subset of machine learning parameters in dependence on the computedproportion.
 15. The computer system of claim 14, comprising a pluralityof processors, wherein each processor is configured to process arespective subset of the training data.
 16. The computer system of claim15, wherein the adjustment parameter is updated in dependence on anaggregated proportion of values for all processors falling above apredefined threshold, the aggregated proportion computed by aggregatinga computed proportion of the subset of values falling above thepredefined threshold for each of the plurality of processors.
 17. Anon-transitory computer-readable storage medium storing computer programinstructions which when executed perform a method of training, based ona set of training data, a multi-layer neural network comprising a set ofnetwork weights, the method comprising: processing the training data inrespective forward and backward passes through a sequence of layers ofthe network, the forward pass comprising computing a set of activationsby applying an activation function in dependence on the network weightsand training data, and the backward pass comprising determining a set ofgradients of a pre-determined loss function with respect to the weightsand/or activations of the network, wherein an adjustment parameter isapplied to at least a subset of values in the neural network, andwherein the values on the forward pass comprise at least one of thenetwork weights and computed activations, and the values on thebackwards pass comprise the computed gradients with respect toactivations and gradients with respect to weights; updating the networkweights in dependence on the computed gradients with respect to theweights; on at least one of the forward and backward pass, computing aproportion of the subset of values falling above a predefined threshold;and updating the adjustment parameter applied to the subset of machinelearning parameters in dependence on the computed proportion.